Gleaning answers from the web∗
نویسنده
چکیده
A wide variety of valuable textual information resides on the Web, but very little is in a machineunderstandable form such as XML. Instead, the content is usually embedded in HTML markup or other encodings designed for human consumption. The information extraction task is to automatically populate a database with content gleaned from information sources such as Web pages. Wrappers are an important special case of the general information extraction task. A wrapper is a specialized information extraction module tailored for a particular source. For example, a meta-search engine needs a distinct wrapper for each of its underlying search engines. ∗Position paper, AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases. Wrappers are usually fairly simple patternmatching programs, because in many applications the documents being processed are highly regular, such as machine-generated HTML text emitted from CGI scripts. Nevertheless, automated approaches to wrappers construction are essential if we want to scale up our applications to integrate data from dozens or thousands of sources. Wrapper induction [8, 12, 10] involves using machine learning techniques to automatically generate wrappers. The input to a wrapper induction algorithm is a set of training examples (sample documents annotated with the information that should be extracted from each), and the output is a wrapper. My contributions include the identification of several classes of wrappers that are both:
منابع مشابه
Finding Community Base on Web Graph Clustering
Search Pointers organize the main part of the application on the Internet. However, because of Information management hardware, high volume of data and word similarities in different fields the most answers to the user s’ questions aren`t correct. So the web graph clustering and cluster placement in corresponding answers helps user to achieve his or her intended results. Community (web communit...
متن کاملA light-weight Web-at-a-Glance system for intelligent information retrieval
Web-at-a-Glance (WAG) is a system to assist the user in information retrieval and discovery by gleaning the most relevant information from a web site or several web sites. This paper presents this new approach for intelligent information retrieval from web sites, and describes the prototyping of the light-weight WAG system as an active index system. q 1998 Elsevier Science B.V. All rights reser...
متن کاملRevisiting adaptations of neotropical katydids (Orthoptera: Tettigoniidae) to gleaning bat predation
All animals have defenses against predators, but assessing the effectiveness of such traits is challenging. Neotropical katydids (Orthoptera: Tettigoniidae) are an abundant, ubiquitous, and diverse group of large insects eaten by a variety of predators, including substrate-gleaning bats. Gleaning bats capture food from surfaces and usually use prey-generated sounds to detect and locate prey. A ...
متن کاملResource Gleaning, From Earlier Times to the Information Age
Inspired by the film documentary The Gleaners and I, the paper defines two senses of gleaning: (1) generally, the collection of items in small quantities, and (2) more specifically, the collection of items missed or rejected during previous harvesting. As an activity, gleaning in both senses is a neglected but essential activity in the solving of problems of lack of resource, especially now in ...
متن کاملPersonalized Questions, Answers and Grammars: Aiding the Search for Relevant Web Information
This work is about guiding the user web search by generating most relevant questions, answers and grammars from web documents. The proposed approach is based on the representation of the main domain concepts as a set of attributes and relating these attributes to the user models and to a syntactico-semantic taxonomy, that describes the general relationships between conceptual and linguistic kno...
متن کامل